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Cross-lingual adaptation, a special case of domain adaptation, refers to the transfer of classifi- 
cation knowledge between two languages. In this article we describe an extension of Structural 
Correspondence Learning (SCL), a recently proposed algorithm for domain adaptation, for cross- 
lingual adaptation. The proposed method uses unlabeled documents from both languages, along 
with a word translation oracle, to induce cross-lingual feature correspondences. Prom these cor- 
respondences a cross-lingual representation is created that enables the transfer of classification 
knowledge from the source to the target language. The main advantages of this approach over 
other approaches are its resource efficiency and task speeifieity. 

We conduct experiments in the area of cross-language topic and sentiment classification involv- 
ing English as source language and German, French, and Japanese as target languages. The results 
show a significant improvement of the proposed method over a machine translation baseline, re- 
ducing the relative error due to cross-lingual adaptation by an average of 30% (topic classification) 
and 59% (sentiment classification). We further report on empirical analyses that reveal insights 
into the use of unlabeled data, the sensitivity with respect to important hyperparameters, and 
the nature of the induced cross-lingual correspondences. 

Categories and Subject Descriptors: Hi.3.3 [Information Storage and Retrieval]: Information 
Search and Retrieval — information filtering; 1.2.7 [Artificial Intelligence]: Natural Language 
Processing — Text analysis 

General Terms: Cross-language text classification, cross-lingual adaptation 

Additional Key Words and Phrases: Structural Correspondence Learning, cross-language senti- 
ment analysis 



1. INTRODUCTION 

Over the past two decades supervised machine learning methods have been success- 
fully applied to many problems in natural language processing (e.g., named entity 
recognition, relation extraction, sentiment analysis) and information retrieval (e.g., 
text classification, information filtering). These methods, however, rely on large, 
annotated training corpora, whose acquisition is time-consuming, costly, and inher- 
ently language-specific. As a consequence most of the available training corpora are 
in English only. Since an ever increasing fraction of the textual content available in 
digital form is written in languages other than English^ , this limits the widespread 



^This is especially the case for the World Wide Web, where from 2000 to 2009 the content 
available in Chinese grew more than four times as much as the content available in English 
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application of state-of-the-art techniques from natural language processing (NLP) 
and information retrieval (IR). Technology for cross-lingual adaptation aims to 
overcome this problem by transferring the knowledge encoded within annotated 
(= labeled) data written in a source language to create a classifier for a different 
target language. Cross-lingual adaptation can thus be viewed as a special case of 
domain adaptation, where each language acts as a separate domain. 

In contrast to "classical" domain adaptation, cross-lingual adaptation is charac- 
terized by the fact that the two domains, i.e., the languages, have non-overlapping 
feature spaces, which has both theoretical and practical implications for domain 
adaptation. In classical domain adaptation — as well as in related problems such as 
covariate shift — the factor of overlapping feature spaces is implicitly presumed by 
the following or similar assumptions: (1) generalizable features, i.e., features which 
behave similarly in both domains, exist [Jiang and Zhai 2007; Blitzer et al. 2006; 
Daume 2007], or, (2) the support of the test data distribution is contained in the 
support of the training data distribution [Bickel et al. 2009]. If, on the other hand, 
the feature sets are non-overlapping, one needs external knowledge to link features 
of the source domain and the target domain [Dai et al. 2008] . 

This article extends the work of Prettenhofer and Stein [2010] and presents an 
approach for cross-lingual adaptation in the context of text classification: Cross- 
Language Structural Correspondence Learning (CL-SCL). CL-SCL uses unlabeled 
data from both languages along with external domain knowledge in the form of 
a word translation oracle to induce cross-lingual word correspondences. The ap- 
proach is based on Structural Correspondence Learning (SCL), a recently proposed 
algorithm for domain adaptation in natural language processing. 

Similar to SCL, CL-SCL induces correspondences among the words from both 
languages using a small number of so-called pivots. In CL-SCL, a pivot is a pair 
of words, {1/75, u>7-}, from the source language S and the target language T, which 
possess a similar semantics. Testing the occurrence of ws or wj- in a set of unlabeled 
documents from S and T yields two equivalence classes across these languages: one 
class contains the documents where either wg or wj- occur, the other class contains 
the documents where neither ws nor wj- occur. Ideally, a pivot splits the set of un- 
labeled documents with respect to the semantics that is associated with {w^, UI7-}. 
The correlation between ws or ivj- and other words w, w ^ {wsjW-j-} is modeled 
by a linear classifier, which then is used as a language-independent predictor for 
the two equivalence classes. A small number of pivots can capture a sufficiently 
large part of the correspondences between S and T in order to (1) construct a cross- 
lingual representation and (2) learn a classifier that operates on this representation. 
Several advantages follow from this approach: 

— Task specificity. The approach exploits the words' pragmatics since it 
considers — during the pivot selection step — task-specific characteristics of language 
use. 

— Efficiency in terms of linguistic resources. The approach uses unlabeled doc- 
uments from both languages along with a small number (100 - 500) of translated 



(http://www.internetworldstats.com/stats7.htm, June 2010). 
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Fig. 1. A taxonomy of transfer learning settings, organized the dimension "domain" 
and "task" . The domain adaptation branch is unfolded. 

words, instead of employing a parallel corpus or an extensive bilingual dictionary. 

— Efficiency in terms of computing resources. The approach solves the classifi- 
cation problem directly, instead of resorting to a more general and potentially much 
harder problem such as machine translation. 

The article is organized as follows: Section 2 discusses cross-lingual adaptation 
in the context of related work including domain adaptation and dataset shift. Sec- 
tion 3 introduces the problem of cross-language text classification, a special case 
of domain adaptation. Section 4 describes Cross-Language Structural Correspon- 
dence Learning and proposes a new regularization schema for the pivot predictors. 
Section 5 reports on the design and the results of experiments in the area of cross- 
language sentiment and topic classification. Finally, Section 6 concludes our work. 

2. RELATED WORK 

The idea to transfer knowledge from a source learning setting 5 to a different target 
learning setting T is an active field of research [Pan and Yang 2009] , and Figure 1 
organizes well-known problems within a taxonomy. The taxonomy combines the 
two most important determinants within a learning setting, namely, the domain 
and the task. A domain is defined by (1) a set of features M, (2) a space of possible 
feature vector realizations x, which typically is the K'^', and (3) a probability 
distribution P{x) over the space of possible feature vector realizations.^ A task 
specifies a set of labels corresponding to classes, typically {-1-1,— 1}, along with a 
conditional distribution P{y \ x), with y G 1}. Alternatively, a task can be 

specified by a sample {(x, y) | x e G {+1,-1}}. In Figure 1 the domain 

adaptation branch is unfolded since it is the focus of this article. The branch 
"different distributions" addresses problems where the feature sets are unchanged; 
without loss of generality Ps{'^) Pt{^) can also be presumed for problems in the 
branch "different feature space" . 



^If clear without ambiguity we use x or j/ to denote both a rcahzation and a random variable. 
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2.1 Domain Adaptation 

Domain adaptation refers to the problem of adapting a statistical classifier trained 
on data from one (or more) source domains to a different target domain . In the 
basic domain adaptation setting we are given labeled data from a source domain <S 
and unlabeled data from the target domain T, and the goal is to train a classifier 
for the target domain. Beyond this setting one can further distinguish whether a 
small amount of labeled data from the target domain is available [Daume 2007; 
Finkel and Manning 2009] or not [Blitzer et al. 2006; Jiang and Zhai 2007]. The 
latter setting is referred to as unsupervised domain adaptation. 

Blitzer et al. [2006] propose an effective algorithm for unsupervised domain adap- 
tation, called Structural Correspondence Learning. Within a first step SCL iden- 
tifies features that generalize across domains, which the authors call pivots. SCL 
then models the correlation between the pivots and all other features by training 
linear classifiers on the unlabeled data from both domains. This information is used 
to induce correspondences among features from the different domains and to learn 
a shared representation that is meaningful across both domains. SCL is related to 
the structural learning paradigm introduced by Ando and Zhang [2005a] . The basic 
idea of structural learning is to constrain the hypothesis space of a learning task by 
considering multiple different but related tasks on the same input space. Ando and 
Zhang [2005b] present a semi-supervised learning method, Alternating Structural 
Optimization (ASO), based on this paradigm, which generates related tasks from 
unlabeled data. They show that ASO delivers state-of-the-art performance for a 
variety of natural language processing tasks including named entity and syntactic 
chunking. Quattoni et al. [2007] apply structural learning to image classification in 
settings where little labeled data is given. 

2.2 Dataset Shift 

Traditional machine learning assumes that both training and test examples are 
drawn from identical distributions. In practice this assumption is often violated, 
for instance due to the irreproducibility of the test conditions within the training 
phase. Dataset shift refers to the general problem when the joint distribution of 
inputs and outputs differs between training phase and test phase. The difference 
between dataset shift and domain adaptation is subtle; in fact, both refer to the 
same underlying problem but emerge from the viewpoints of different research com- 
munities. Dataset shift is coined by the machine learning community and builds 
on prior work in statistics, in particular the work on covariatc shift [Shimodaira 
2000] and sample selection bias [Cortes ct al. 2008]. In contrast, domain adap- 
tation originates from the natural language processing community. Most of the 
early work on domain adaptation focuses on the question of how to leverage "out- 
domain data" (= data associated with S) effectively to learn a classifier when only 
little or no labeled "in-domain data" (= data associated with T) is available. The 
latter emphasizes the relationship to semi-s\iperviscd learning with the; crucial 
difference that labeled and unlabeled data stem from different distributions. Co- 
variate shift can be considered as a certain case of dataset shift which is closely 
related to unsupervised domain adaptation. It is characterized by the fact that the 
class conditional distribution between training phase and test phase is equal, i.e. 
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Ps{y I x) = Pfiy I x), while the marginal distribution of the inputs (covariates) 
differs, i.e. Ps{^) -Pt(x). A broad discussion of dataset shift is beyond the scope 
of this article; the interested reader is referred to [Quionero-Candela et al. 2009] . 

2.3 Cross-Lingual Adaptation 

Analogous to domain adaptation, cross-lingual adaptation refers to the problem 
of adapting a statistical classifier trained on data from a source language 5 to a 
different target language T. Examples include the adaptation of a named-entity 
recognizer, a syntactic parser, or a relation extractor. The major characteristic of 
cross-lingual adaptation is the fact that the two "domains" have non-overlapping 
features sets, i.e., Mg ^ M-y. While cross-lingual adaptation has not received a 
lot of attention in the natural language processing community, a special case of 
cross-lingual adaptation has received a lot of attention recently: cross-language 
text classification, which is also the focus of this article. 

Bel et al. [2003] belong to the first who explicitly considered the problem of 
cross-language text classification. Their research, however, is predated by work in 
cross-language information retrieval, CLIR, where similar problems are addressed 
[Oard 1998]. Traditional approaches to cross-language text classification and CLIR 
use linguistic resources such as bilingual dictionaries or parallel corpora to induce 
correspondences between two languages [Lavrenko et al. 2002; Olsson et al. 2005]. 
Dumais et al. [1997] is considered as seminal work in CLIR: they propose a method 
which induces semantic correspondences between two languages by performing la- 
tent semantic analysis, LSA, on a parallel c;orpiis. Li and Taylor [2007] improve 
upon this method by employing kernel canonical correlation analysis, CCA, instead 
of LSA. The major limitation of these approaches is their computational complexity 
and, in particular, the dependcmce on a parallel corpus, which is hard to obtain — 
especially for less resource-rich languages. Gliozzo and Strapparava [2005] circum- 
vent the dependence on a parallel corpus by using so-called multilingual domain 
models, which can be acquired from comparable corpora in an unsupervised man- 
ner. In [Gliozzo and Strapparava 2006] they show for particular tasks that their 
approach can achieve a performance close to that of monolingual text classification. 

Recent work in cross-language text classification focuses on the use of automatic 
machine translation technology. Most of these methods involve two steps: (1) trans- 
lation of the documents into the source or the target language, and (2) dimension- 
ality reduction or semi-supervised learning to reduce the noise introduced by the 
machine translation. Methods which follow this two-step approach include the 
EM-based approach by Rigutini et al. [2005], the CCA approach by Fortuna and 
Shawe- Taylor [2005], the information bottleneck approach by Ling et al. [2008], and 
the CO- training approach by Wan [2009]. 

3. CROSS-LANGUAGE TEXT CLASSIFICATION 

In standard text classification, a document d is represented under the bag-of-words 
model as [F [-dimensional feature vector x G X, where V, the vocabulary, denotes 
an ordered set of words, Xi S x denotes the normalized frequency of word i in d, and 
X is an inner product space. Ds denotes the training set and comprises tuples of 
the form (x, y), which associate a feature vector x G X with a class label y & y. For 
simplicity but without loss of generality we assume binary classification problems, 
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3^ = {+l,-l}. The goal is to find a classifier f : X ^ y that predicts the labels of 
new, previously unseen documents. In the following, we restrict ourselves to linear 
classifiers: 

/(x) = sififn(w'^x), (1) 

where w is a weight vector that parameterizes the classifier and [■]'^ denotes the 
matrix transpose. The computation of w from Ds is referred to as model estimation 
or training. A common choice for w is given by a vector w* that minimizes the 
regularized training error: 

w* = argmin ^ L{y, w^x) + XR{w). (2) 

L is a loss function that measures the quality of the classifier, i? is a regularization 
term that penalizes model complexity, and A is a non-negative hyperparameter that 
models the tradeoff between classification performance and model complexity. A 
common choice for R is L2-regularization, which imposes an L2-norm penalty on w, 
i?(w) = |||w||2 = 5 w-^w. Different choices for L entail different classifier types; 
e.g., when choosing the hinge loss function one obtains the popular Support Vector 
Machine classifier [Zhang 2004]. 

Standard text classification distinguishes between labeled (training) documents 
and unlabeled (test) documents. Cross-language text classification poses an extra 
constraint in that training documents and test documents are written in different 
languages. Here, the language of the training documents is referred to as source lan- 
guage S, and the language of the test documents is referred to as target language T. 
The vocabulary V divides into Vs and Vr, called vocabulary of the source language 
and vocabulary of the target language, with VsdVr = 9- I.e., documents from the 
training set and the test set map onto non-overlapping regions of the feature space. 
Thus, a linear classifier trained on Ds associates non-zero weights only with words 
from Vs, which in turn means that it cannot be used to classify documents written 
inT. 

One way to overcome this "feature barrier" is to find a cross-lingual representa- 
tion for documents written in <S and T, which enables the transfer of classification 

knowledge between the two languages. Intuitively, one can understand such a cross- 
lingual representation as a concept space that underlies both languages. In the fol- 
lowing, we will use 9 to denote a map that associates the original |y|-dimensional 
representation of a document d written in 5 or T with its cross-lingual representa- 
tion. Once such a mapping is found the cross-language text classification problem 
reduces to a standard classification problem in the cross-lingual space. Note that 
the existing methods for cross-language text classification can be characterized by 
the way 9 is constructed. For instance, cross-langiiagc latent semantic indexing 
[Dumais et al. 1997] and cross-language explicit semantic analysis [Potthast et al. 
2008] estimate 9 using a parallel corpus. Other methods use linguistic resources 
such as a bilingual dictionary to obtain 9 [Bel et al. 2003; Olsson et al. 2005; Wu 
et al. 2008]. 
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4. CROSS-LANGUAGE STRUCTURAL CORRESPONDENCE LEARNING 

We now present a method for learning a map 9 by exploiting relations from un- 
labeled documents written in S and T. The proposed method, which we call 
cross-language structural correspondence learning, CL-SCL, addresses the follow- 
ing learning setup (see also Figure 2): 

(1) Given a set of labeled training documents Ds written in language 5, the goal 
is to create a text classifier for documents written in a different language T. We 
refer to this classification task as the target task. An example for the target task is 
the determination of sentiment polarity, either positive or negative, of book reviews 
written in German (T) given a set of training reviews written in English (<S). 

(2) In addition to the labeled training documents Dg we have access to unla- 
beled documents Ds^u and -Dr,« from both languages S and T- Let £)„ denote 
Ds,u U Dr,u- 

(3) Finally, we are given a budget of calls to a word translation oracle (e.g., a 
domain expert) to map words in the source vocabulary Vs to their corresponding 
translations in the target vocabulary Vr. For simplicity and without loss of appli- 
cability we assume here that the word translation oracle maps each word in Vs to 

exactly one word in V-y. 

CL-SCL comprises three steps: In the first step, CL-SCL selects word pairs 
{ws^w-r), called pivots, where ws &Vs and wq- e V-r. Pivots have to satisfy the 
following conditions: 

Confidence. Both words, ws and w-y, are predictive for the target task. 
Support. Both words, ws and w-j-, occur frequently in Ds^u and Df^^, respec- 
tively. 

The confidence condition ensures that, in the second step of CL-SCL, only those 
correlations are modeled that are useful for discriminative learning. The support 
condition, on the other hand, ensures that these correlations can be estimated accu- 
rately. Considering our sentiment classification example, the word pair {excellent^, 
exzellentr} satisfies both conditions: (1) the words are strong indicators of positive 
sentiment, and (2) the words occur frequently in book reviews from both languages. 
Note that the support of wg and wr can be determined from the unlabeled data 
£>„. The confidence, however, can only be determined for ws since the setting gives 
us access to labeled data from S only. 

We use the following heuristic to form an ordered set P of pivots: First, we 
choose a subset Vp from the source vocabulary Vs, \Vp\ <C \Vs\, which contains 
those words with the highest mutual information with respect to the class label of 
the target task in Ds. Second, for each word ws & Vp we find its translation in the 
target vocabulary Vj- by querying the translation oracle; we refer to the resulting 
set of word pairs as the candidate pivots, P' : 

P' = {{W75, TRANSLATE | %Us G Vp}- 

We then enforce the support condition by eliminating in P' all candidate pivots 
{^5,^7-} where the document frequency of ws in Ds,u or of wj- in r>r,u is smaller 
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Fig. 2. The document sets underlying CL-SCL. The subscripts 5, 7-, and „ designate "source 
language", "target language", and "unlabeled". 



than some threshold 0: 

P = CANDIDATEElIMINATIOn(P', (j)). 

Let m denote |P|, the number of pivots. 

In the second step, CL-SCL models the correlations between each pivot 
{wsjW-t} G P and all other words w Cz V \ {wsjW'j-}. This is done by train- 
ing linear classifiers that predict whether or not ws or wj- occur in a document, 
based on the other words. For this purpose a training set Di is created for each 
pivot pi E P : 

Di = {(mask(x,p;), in(x,p;)) | X G D„} 

mask(x,p;) is a function that returns a copy of x where the components asso- 
ciated with the two words in pi are set to zero — which is equivalent to removing 
these words from the feature space. iN(x,p/) returns -1-1 if one of the components 
of x associated with the words in pi is non-zero and -1 otherwise. For each Di 
a linear classifier, characterized by the parameter vector w/, is trained by mini- 
mizing Equation (2) on Di. Note that each training set Di contains documents 
from both languages. Thus, for a pivot pi = {w^, W7-} the vector w; captures both 
the correlation between wg and Vg \ {^5} and the correlation between wj- and 
VtMwt). 

In the third step, CL-SCL identifies correlations across pivots by computing the 
singular value decomposition of the x m-dimensional parameter matrix W, 
W = [wi ... w„i] : 

USV^ = SVD(W). 

Recall that W encodes the correlation structure between pivot and non-pivot 
words in the form of multiple linear classifiers. Thus, the columns of U identify 
common substructures among these classifiers. Choosing the columns of U associ- 
ated with the largest singular values yields those substructures that capture most 
of the correlation in W. We define 9 as those columns of U that are associated 
with the k largest singular values: 
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Algorithm 1 CL-SCL 



Input: 



Labeled source data Ds 
Unlabeled data = Ds,u U Dj-^. 
TO, k, X, and </> 
k X IF I -dimensional matrix 9 



Parameters: 
Output: 



1. selectPivots(Z)5,to) 

Vp = mutualInformation(D5) 

P' = {{li;5,TRANSLATE('U;5)} | Ws G Vp} 
P = CANDIDATEELIMINATI0N(P', (/)) 

2. trainPivotPredictors(£'„, P) 

for / = 1 to m do 

Di = {(MASK(x,P;), IN(x,P;)) | X € £>„} 

w; = argmin Hv^ w'^x)) -|- XR{w) 

wSRl^l {x,y)eDi 

end for 

W = [wi ... w„] 

3. computeSVD(W,A;) 

USV^ = SVD(W) 

output {6} 



Algorithm 1 summarizes the three steps of CL-SCL. At training and test time, 
we apply the projection to each input instance x. The vector v* that minimizes 
the regularized training error for Ds in the projected space is defined as follows: 



The resulting classifier, which will operate in the cross-lingual setting, is defined 
as follows: 



4.1 An Alternative View of CL-SCL 

An alternative view of cross-language structural correspondence learning is pro- 
vided by the framework of structural learning [Ando and Zhang 2005a]. The basic 
idea of structural learning is to constrain the hypothesis space, i.e., the space of 
possible weight vectors, of the target task by considering multiple diS'erent but re- 
lated prediction tasks. In our context these auxiliary tasks axe represented by the 
pivot predictors, i.e., the columns of W. Each column vector w; can be considered 
as a linear classifier which performs well in both languages. Thus, we can regard 
the column space of W as an approximation to the subspace of bilingual classifiers. 



V* = argmin ^ L{y, v^Q-x.) + Ai?(v). 



(3) 



/(x) = sign{v*^ Oyi). 



10 • Prettenhofer, Stein 




Fig. 3. Illustration of the subspace constraint for |y| = 3 and k = 2. The plane spanned by 6i 
and 02 shows the subspace induced by the two left singular vectors of W = [wi W2 W3] associated 
with the largest singular values. For the target task, we restrict the weight vector w to lie in the 
subspace of the parameter space defined by 6^, w = 6^ v. 

By computing SVD(W) one obtains a compact representation of this column space 
in the form of an orthonormal basis 6^ . 

The subspace is used to constrain tlic learning of the target task by restricting 
the weight vector w to lie in the subspace defined by 6^. Following Ando and 
Zhang [2005a] and Quattoni et al. [2007] we choose w for the target task to be 
w* = where v* is defined as follows: 

V* = argmin V L{y, (6'^v)^x) + Ai?(v). (4) 

Since (O^v)"^ = w'^O it follows that this view of CL-SCL corresponds to the 

induction of a new feature space given by Equation 3. 

Figure 3 illustrates the basic idea of the subspace constraint for |V| = 3 and 
k = 2. 

4.2 Computational Considerations 

While the second step of CL-SCL involves the training of a fairly large number of 

linear classifiers, these classifiers can be learned very efficiently due to (1) efficient 
learning algorithms for linear classifiers [Shwartz et al. 2007] and (2) the fact that 
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learning the pivot classifiers is an embarrassingly parallel problem. The compu- 
tational bottleneck of the CL-SCL procedure is the SVD of the dense parameter 
matrix W. In order to make the computation tractable, Ando and Zhang [2005a] as 
well as Blitzer et al. [2007] propose to set negative entries in W to zero, in order to 
obtain a sparse matrix for which the SVD can be computed more efficiently [Berry 
1992]. As a rational for this step the authors claim that the involved features "yield 
much less specific information" on the target concept, while "positive weights are 
usually directly related to the target concept" [Ando and Zhang 2005a]. 

We propose a different strategy to obtain a sparse parameter matrix W, namely 
to enforce sparse pivot classifiers w; by employing a proper regularization term R in 
the second step of CL-SCL. A straight-forward solution is to use LI regularization 
[Tibshirani 1996], which imposes an Ll-norm penalty on w, i?(w) = ||w||-^ = 

This strategy recently gained much attention in the natural language 
processing community; Gao et al. [2007] show that LI regularized models have 
similar predictive power to L2 regularized models while being much smaller at the 
same time — i.e., less parameters are non-zero. 

LI regularization, however, has properties which are inadequate in the context 
of SCL, in particular its handling of highly correlated features. Zou and Hastie 
[2005] show that if there is a subset of features among which the pairwise corre- 
lations are high, LI regularization tends to select only one feature while pushing 
the other feature weights to zero. This is certainly not desirable for SCL since it 
relies on the proper modeling of correlations in order to induce correspondences 
among features. L2 regularization, by contrast, exhibits such a grouping behavior, 
resulting in equal weights for correlated features. The Elastic Net combines both 
properties, the grouping behavior of L2 regularization and the sparsity property of 
LI regularization [Zou and Hastie 2005]. It is given by the convex combination of 
both norms: 

i?(w) = aj|w||^ + (l-a)||w[|„ (5) 

where a G [0, 1] models the trade-off between grouping and sparsity. The Elastic 
Net is widely used in bioinformatics, in particular the study of gene expression, 
however, its use for applications in natural language processing or information 
retrieval has not been studied yet. 

5. EXPERIMENTS 

We evaluate CL-SCL for the task of cross-language sentiment and topic classifica- 
tion using English as source language and German, French, and Japanese as target 
languages. We first describe the experimental design and give implementation de- 
tails, we then present the evaluation results and, finally, we report on detailed 
analyses with respect to the nature of the induced cross-lingual correspondences, 
the use of unlabeled data, and important hyperparameters including the impact of 
different regularization methods. 
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5.1 Datasets 

We use the cross-lingual sentiment dataset provided by Prettenhofer and Stein 
[2010].^ The dataset contains Amazon product reviews for the three product cat- 
egories books, dvds, and music in the languages English, German, French and 
Japanese. Each document is labeled according to its sentiment polarity as either 
positive or negative. The documents in the dataset are organized by language and 
product category. For each language-category pair there are three balanced disjoint 
sets of training, test, and unlabeled documents; the respective set sizes are 2,000, 
2,000, and 9,000-50,000. Similar to Prettenhofer and Stein [2010], each document 
d is represented as a normalized (unit length) feature vector x under a unigram 
bag-of-words model. Based on this dataset we create two tasks (see Table I for a 
summary statistics): 

Sentiment Classification Task. For the task of cross-language sentiment classifi- 
cation the original partitioning of the cross-lingual sentiment dataset is used. Anal- 
ogous to Prettenhofer and Stein [2010] English is employed as source language, and 
German, French, and Japanese as target languages. For each of the nine target- 
language-category-combinations a sentiment classification task is created by taking 
the training set and the unlabeled set for some product category from <S and the 
test set and the unlabeled set for the same product category from T. 

Topic Classification Task (Product Categories). For the task of cross-language 
topic classification we discard the original sentiment labels and use the product 
category, i.e., books, dvd, and music as the document label. Again we use English 
as the source language and German, French, and Japanese as target languages. 
Note that — in contrast to the sentiment classification tasks — classifying reviews 
according to product categories is a multi-class classification problem with three 
mutually exclusive classes. Hence for each of the three target languages a cross- 
language topic classification task is created by combining the training set and the 
unlabeled set of each product category from <S with the test set and the unlabeled 
set of each product category from T. For each of the three tasks we have 6,000 
training and 6,000 test documents, each containing a balanced number of examples. 

5.2 Implementation 

Within all experiments we employ linear classifiers, which are trained by minimizing 
Equation (2) using a stochastic gradient descent (SGD) algorithm. In particular, 
we use the plain SGD algorithm as described by Zhang [2004] while adopting the 
learning rate schedule from PEGASOS [Shwartz et al. 2007]. Analogous to Blitzer 
et al. [2007] and Ando and Zhang [2005a] we employ as loss function L the modified 
Huber loss [Zhang 2004] , a smoothed version of the hinge loss: 



SGD and related methods based on stochastic approximation have been suc- 
cessfully applied to solve large-scale linear prediction problems in natural language 



^Available at http://www.webis.de/research/corpora/webis-cls-10/ 




max(0, 1 — py)^, if py > —1 
— 4r)w, otherwise. 



(6) 
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Table I. Dataset statistics. 



r 


Category 


Unlabeled data 


Labeled data 


Vocabulary 


\Ds,u\ 


\Dt,u\ 


\Ds\ 


\Dt\ 


\Vs\ 


\Vt\ 


German 


books 

dvd 

music 


50,000 
30,000 
25,000 


50,000 
50,000 
50,000 


2,000 
2,000 
2,000 


2,000 
2,000 
2,000 


64,682 
52,822 
41,306 


108,573 
103,862 
99,287 


French 


books 

dvd 

music 


50,000 
30,000 
25,000 


32,000 
9,000 
16,000 


2,000 
2,000 
2,000 


2,000 
2,000 
2,000 


64,682 
52,822 
41,306 


55,016 
29,519 
42,097 


Japanese 


books 

dvd 

music 


50,000 
30,000 
25,000 


50,000 
50,000 
50,000 


2,000 
2,000 
2,000 


2,000 
2,000 
2,000 


64,682 
52,822 
41,306 


52,311 
54,533 
54,463 


German 

French 

Japanese 




60,000 
60,000 
60,000 


60,000 
45,000 
60,000 


6,000 
6,000 
6,000 


6,000 
6,000 
6,000 


76,629 
76,629 
76,629 


124,529 
74,807 
64,050 



Summary statistics for the nine cross-language sentiment classification tasks (first nine 
rows) and the three cross-language topic classification tasks (last three rows). |D5,ti| 
and I-Dt-^I give the number of unlabeled documents from S and T; \D^\ and I-D7-I 
give the number of training and test documents. All document sets are balanced. 

processing and information retrieval [Zhang 2004; Shwartz et al. 2007]. Their major 
advantages are efficiency and ease of implementation. 

SGD, however, cannot be applied directly in connection with LI regularization 
(and thus the Elastic Net) due to the fluctuations of the approximated gradients. 
To overcome this problem different solutions have been proposed, in particular 
methods based on truncated gradients [Langford et al. 2009; Tsuruoka et al. 2009] 
and projected gradients [Duchi et al. 2008]. In our experiments we employ the 
truncated stochastic gradient algorithm proposed by Tsuruoka et al. [2009] , which 
uses the cumulative LI penalty to smooth out fluctuations in the approximated 
gradients.'' Note that Elastic Net regularization is applied for the pivot classifiers 
only, all other classifiers are trained using L2 regularization. 

SGD receives two hyperparameters as input: the number of iterations T, and 
the regularization parameter A. In our experiments T is always set to 10®, which is 
about the number of iterations required for SGD to converge. For the target task, 
A is determined by 3-fold cross-validation, testing for A all values 10~*,i € [0;6]. 
For the pivot prediction task, A is set to the small value of 10~^, in order to favor 
model accuracy over generalizability. 

Since SGD is sensitive to feature scaling the projection 9x is post-processed as 
follows: (1) Each feature of the cross-lingual representation is standardized to zero 
mean and unit variance, where mean and variance are estimated on Ds U 
(2) The cross-lingual document representations are scaled by a constant a such 
that \Ds\-'Y:^^o, lla^xll = 1. 

For multi-class classification the one-against-all-strategy is applied. For multi- 
class problems, the set of pivot candidates Vp is formed as follows: (1) rank for each 
class the words according to mutual information with respect to all other classes, 
and (2) select from each ranking those words with the highest mutual information. 



*Our implementation is available at http://github.com/pprett/bolt 
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We use the bilingual dictionary provided by Prettenhofer and Stein [2010] as 
word translation oracle. If the source word is not contained in the dictionary we 
resort Google Translate, which returns a single translation for each query word.^ 
Note that the word translation oracle operates context-free, which is suboptimal; 
however, we do not sanitize the translations to demonstrate the robustness of CL- 
SCL with respect to translation noise. 

5.3 Upper Bound and Baseline 

To get an upper bound on the performance of a cross-language method we first 
consider the monolingual setting. For each task a linear classifier is learned on 
the training set of the target language and tested on the test set. The resulting 
accuracy scores arc referred to as upper bound; this bound informs us about the 
expected performance on the target task if training data in the target language is 
available. 

We choose a machine translation baseline to compare CL-SCL to another cross- 
language method. Statistical machine translation technology offers a straightfor- 
ward solution to the problem of cross-language text classification and has been 
used in a number of cross-language sentiment classification studies [Hiroshi et al. 
2004; Bautin et al. 2008; Wan 2009]. Our bascUne CL-MT is determined as follows: 
(1) learn a linear classifier on the training data, and (2) translate the test docu- 
ments into the source language, (3) predict the sentiment polarity of the translated 
test documents. 

Translations of the tost documents into the source language via Google Translate 
are provided by Prettenhofer and Stein [2010]. Note that the baseline CL-MT does 
not make use of unlabeled documents. 

5.4 Experimental Results 

Table II contrasts the classification performance of CL-SCL with the upper bound 
and the baseline. Due to the inherent randomness of the training algorithm, we 
report the acciiracy scores as mean ^ and standard deviation a of ten repetitions 
of SGD. We use McNemar's test to analyze whether or not the results of CL- 
SCL and CL-MT are statistically significant [Dietterich 1998]. Again, due to the 
randomness of the training algorithm statistical significance is analyzed for each 
of the ten repetitions, whereas significance at a specific level is reported only if it 
applies to all repetitions. 

Observe that the upper bound does not exhibit high variability across the three 
languages. For sentiment classification the average accuracy is about 82%, which 
is consistent with prior work on monolingual sentiment analysis [Pang et al. 2002; 
Blitzer et al. 2007]. For product category classification the average accuracy is 
in the low 90's, which is also consistent with prior work on monolingual product 
category classification [Crammer et al. 2009]. 

The performance of CL-MT, however, differs considerably between the two Eu- 
ropean languages and Japanese: for Japanese, the averaged differences between 
the upper bound and CL-MT (9.5%, 7.3%) are much larger than for German and 
French (5.3%, 1.7%). This can be explained by the fact that machine translation 



'http : //translate . google . com 
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Table II. Cross-language sentiment and topic classification results. 



r 


Cat. 


Upper Bound 


CL-MT 




CL-SCL 












cr 


A 




a 


A 


RR[%] 




books 


83.79 


±0.20 


79.68 


±0.13 


4.11 


t 83.34 


±0.02 


0.45 


89.05 


German 


dvd 


81.78 


±0.27 


77.92 


±0.25 


3.86 


t 80.89 


±0.02 


0.89 


76.94 




music 


82.80 


±0.13 


77.22 


±0.23 


5.58 


t 82.90 


±0.00 


-0.10 


101.79 




books 


83.92 


±0.14 


80.76 


±0.34 


3.16 


81.27 


±0.08 


2.65 


16.14 


French 


dvd 


83.40 


±0.28 


78.83 


±0.19 


4.57 


80.43 


±0.05 


2.97 


35.01 




music 


86.09 


±0.13 


75.78 


±0.65 


10.31 


78.05 


±0.06 


8.04 


22.02 




books 


78.09 


±0.14 


70.22 


±0.27 


7.87 


ft 77.00 


±0.06 


1.09 


86.15 


Japanese 


dvd 


81. .56 


±0.28 


71.30 


±0.28 


10.26 


ft 76.37 


±0.05 


5.19 


49.42 




music 


82.33 


±0.13 


72.02 


±0.29 


10.31 


tt 77.34 


±0.06 


4.99 


51.60 


German 




92.95 


±0.11 


92.25 


±0.07 


0.70 


92.61 


±0.06 


0.34 


51.43 


French 




93.27 


±0.07 


90.58 


±0.17 


2.69 


90.57 


±0.13 


2.70 


-0.37 


Japanese 




89.43 


±0.11 


82.14 


±0.22 


7.29 


tt 85.03 


±0.10 


4.40 


39.64 



Evaluation results for sentiment classification (first nine rows) and topic classification (last three 
rows). Accuracy scores (mean n and standard deviation a of 10 repetitions of SGD) on the test 
set of the target language T are reported. A gives the difference in a<;curacy to the upper 
bound. Statistical significance (McNemar) of CL-SCL is measured against CL-MT (f indicates 
0.05 and tt 0.001). RR gives the relative reduction in error over CL-MT. For sentiment 
classification, CL-SCL uses m = 450, k = 100, </> = 30, and a = 0.85. For topic classification, 
CL-SCL uses m = 250, fc = 50, </> = 50, and a = 0.85. 



works better for European than for Asian languages such as Japanese. 

Recall that CL-SCL receives four hyperparameters as input: the number of pivots 
m, the dimensionality of the cross-lingual representation k, the minimum support (p 
of a pivot in Dg ^ and i?7- „, and the Elastic Net coefficient a. For cross-language 
sentiment classification we use fixed values of to = 450, k = 100, (p = 30, and 
a = 0.85. For cross-language topic classification we found that smaller values of 
TO and k work significantly better. The results for topic classification are obtained 
by using fixed values of to = 250, k = 50, (p = 50, and a = 0.85. The parameter 
settings have been optimized using the German book review task (sentiment) and 
the German task (topic). 

The results show that CL-SCL either outperforms CL-MT or is at least competi- 
tive across all tasks. For German and Japanese sentiment classification we observe 
significant differences at a 0.05 and 0.001 confidence level. For product category 
classification we observe significant differences only for Japanese (0.001 confidence 
level). Interestingly, for German music reviews, the accuracy of CL-SCL even 
surpasses the upper bound; this can be interpreted as a semi-supervised learning 
effect that stems from the massive use of unlabeled data. The rightmost column 
of Table II shows the relative reduction in error due to cross-lingual adaptation 
of CL-SCL over CL-MT. CL-SCL reduces the relative error by an average of 59% 
(sentiment classification) and 30% (topic classification) over CL-MT. 

5.5 Sensitivity Analysis 

CL-SCL receives a number of hyperparameters as input; the purpose of this section 
is to elaborate on each hyperparameter. In the following, we will analyze the 
sensitivity of each hyperparameter in isolation while keeping the others fixed. If not 
specified otherwise, we use the same setting of the hyperparameters as in Table II. 
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Fig. 4. Influence of unlabeled data and hyperparameters on the performance of 
CL-SCL. The rows show the performance of CL-SCL as a function of (1) the ratio 
between labeled and unlabeled documents, (2) the number of pivots to, and (3) the 
dimensionality of the cross-lingual representation k. 

Unlabeled Data. The first row of Figure 4 shows the performance of CL-SCL as a 
function of the ratio of labeled and unlabeled documents for sentiment classification 
of book reviews. A ratio of 1 means that l-D^^uj — |I?7-.u| — 2,000, while a ratio of 
25 corresponds to the setting of Table IL As expected, an increase in the number 
of unlabeled documents results in an improved performance. However, a saturation 
at a ratio of 10 can be observed across most tasks. 

Number of Pivots. The second row shows the influence of the number of pivots 
TO on the performance of CL-SCL. Compared to the size of the vocabularies Vg and 
Vf, which is in 10^ order of magnitude, the number of pivots is very small. The 
plots show that even a small number of pivots captures a significant amount of the 
correspondence between S and T. 

Dimensionality of the Cross-Lingual Representation. The third row shows the 
infiuence of the dimensionality of the cross-lingual representation k on the per- 
formance of CL-SCL. Obviously the SVD is crucial to the success of CL-SCL if 
TO is sufficiently large. Observe that the value of k is task-insensitive: a value of 
50<fc<150 works equally well across all tasks. 
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Table III. Effect of regularization. 



"1:2^ 



r 



German 



French 



Japanese 



Category 

books 

dvd 

music 

books 

dvd 

music 

books 

dvd 

music 



LI 



H d[%] 

79.50 17.88 

77.06 16.84 

77.60 16.00 



IJ. d[%] 
82.45 1.24 



79.02 
78.80 
77.72 

73.09 
71.10 
75.15 



16.50 
19.23 
16.70 

15.21 
14.86 
13.72 



78.60 
81.41 

80.75 
78.70 
77.32 

71.06 
75.75 
76.22 



1.43 
1.72 

1.87 
3.98 
3.72 

1.27 
1.48 
1.83 



Elastic Net 
IJ. A[Vo] 
83.34 11.02 



80.89 
82.90 

81.27 
80.43 
78.05 

77.00 
76.37 
77.34 



12.25 
13.92 

14.13 
23.22 
21.60 

10.47 
11.84 
13.39 



German - 89.69 16.19 88.73 0.92 92.61 8.38 

French - 87.59 16.29 89.65 1.36 90.57 11.37 

Japanese - 82.83 16.71 84.26 1.23 85.03 10.15 

The effect of different regularization terms on the performance of CL-SCL for cross-language 
sentiment (first nine rows) and topic classification (last three rows) . d gives the density of the 
parameter matrix W, i.e., the number of non-zero entries divided by the total number of entries. 
W is 450 X IVI where IV] is in 10" orders of magnitude (see Table 1 for details). Elastic Net uses 
a = 0.85. 



Effect of Regularization. Table III compares the eflFect of three different regu- 
larization terms on the performance of CL-SCL. The third column, L2^, refers to 
the strategy in [Blitzer et al. 2006] and [Prettenhofer and Stein 2010] with ordi- 
nary L2 regularization and negative weights set to zero. The fifth column shows 
the performance of LI regularization. Observe that LI regularization drastically 
reduces the number of non-zero features, from 16% to 2% on average. We argued 
in Section 4.2 that LI regularization is not adequate due to its improper handling 
of highly correlated features and we proposed the Elastic Net penalty as an al- 
ternative. The empirical evidence supports this claim: Elastic Net regularization 
consistently outperforms both L2"'" and LI regularization while keeping the number 
of non-zero features low (15% on average). Note that Elastic Net regularization 
adds an additional hyperparameter a that trades off the relative importance of L2 
and LI regularization. In the above experiments the value of a is chosen such that 
the obtained density roughly equals the density of L2"'". A convenient property of 
the Elastic Net is that it encompasses L2 and LI regularization as special cases 
(either a = 1 or a = 0). Thus, if m and \V\ are sufficiently small and a dense SVD 
is computationally feasible a = 1 in optimal. Otherwise, the optimal choice of a is 
governed by the computing resource. 

The use of Elastic Net regularization to obtain sparse pivot classifiers has impli- 
cations beyond CL-SCL, in particular for the application of Alternating Structural 
Optimization [Ando and Zhang 2005b] and Structural Correspondence Learning 
[Blitzer et al. 2006] in high dimensional feature spaces. 

5.6 Interpretation of Results 

Primarily responsible for the effectiveness of CL-SCL is its task specificity, i.e., 

the way in which context contributes to meaning (pragmatics). Due to the use 
of task-specific, unlabeled data, relevant characteristics are captured by the pivot 
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Table IV. Semantic and pragmatic correlations. 

English German 
Pivot 

Semantics Pragmatics Semantics Pragmatics 

{beautiful^, amaaing, picture, pat- schoner (more beau- bilder (pictures), il- 

schonT"} beauty, lovely tern, poetry, tiful), traurig (sad) lustriert (illustrated) 

photographs, 

paintings 



{boring^, plain, asleep, characters, 
langweiligT"} dry, long pages, story 



langatmig (lengthy), charaktcrc (char- 

einfach (plain), acters), handlung 

enttauscht (disap- (plot), seiten (pages) 
pointed) 



Semantic and pragmatic correlations identified for the two pivots {beautiful^, schouT-} and 
{borings, langweiligT"} in English and German book reviews. 



classifiers. 

Tabic IV exemplifies this claim with two pivots for German book reviews. The 
rows of the table show a selection of words which have the highest correlation with 
the pivots {beautifuls, schonT-} and {borings, langweiligr}- We can distinguish 
between (1) correlations that reflect similar meaning, such as "amazing", "lovely", 
or "plain", and (2) correlations that reflect the pivot pragmatics with respect to 
the task, such as "picture" , "poetry" , or "pages" . 

Note in this connection that the authors of book reviews tend to use the word 
"beautiful" to refer to illustrations or to poetry, and that they use the word "pages" 
to indicate lengthy or boring books. While the first type of word correlations can 
be obtained by methods that operate on parallel corpora, the second correlation 
type requires an understanding of the task-specific language use. 

6. CONCLUSIONS 

We have presented Cross-Language Structural Correspondence Learning, CL-SCL, 
as an effective technology for cross-lingual adaptation. CL-SCL builds on Structural 
Correspondence Learning, a recently proposed algorithm for domain adaptation 
in natural language processing. CL-SCL uses unlabeled documents along with a 
feature translation oracle to automatically induce task-specific, cross-lingual feature 
correspondences. 

We evaluated the approach for cross-language text classification, a special case 
of cross-lingual adaptation. The analysis covers performance and sensitivity issues 
in the context of sentiment and topic classification with English as source language 
and German, French, and Japanese as target languages. The results show a signif- 
icant improvement of the proposed approach over a machine translation baseline, 
reducing the relative error due to cross-lingual adaptation by an average of 59% 
(sentiment classification) and 30% (topic classification) over the baseline. 

Furthermore, the Elastic Net is proposed as an effective means to obtain a sparse 
parameter matrix, again leading to a significant improvement upon previously re- 
ported results. Note Elastic Net has implications beyond CL-SCL, in particular for 
Structural Correspondence Learning [Blitzer et al. 2006] and Alternating Structural 
Optimization [Ando and Zhang 2005a]. 
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